11 research outputs found
NLP-CIC @ PRELEARN: Mastering Prerequisites Relations, from Handcrafted Features to Embeddings
We present our systems and findings for the prerequisite relation learning task (PRELEARN) at EVALITA 2020. The task aims to classify whether a pair of concepts hold a prerequisite relation or not. We model the problem using handcrafted features and embedding representations for in-domain and cross-domain scenarios. Our submissions ranked first place in both scenarios with average F1 score of 0.887 and 0.690 respectively across domains on the test sets. We made our code freely available
Collective moderation of hate, toxicity, and extremity in online discussions
How can citizens moderate hate, toxicity, and extremism in online discourse?
We analyze a large corpus of more than 130,000 discussions on German Twitter
over the turbulent four years marked by the migrant crisis and political
upheavals. With a help of human annotators, language models, machine learning
classifiers, and longitudinal statistical analyses, we discern the dynamics of
different dimensions of discourse. We find that expressing simple opinions, not
necessarily supported by facts but also without insults, relates to the least
hate, toxicity, and extremity of speech and speakers in subsequent discussions.
Sarcasm also helps in achieving those outcomes, in particular in the presence
of organized extreme groups. More constructive comments such as providing facts
or exposing contradictions can backfire and attract more extremity. Mentioning
either outgroups or ingroups is typically related to a deterioration of
discourse in the long run. A pronounced emotional tone, either negative such as
anger or fear, or positive such as enthusiasm and pride, also leads to worse
outcomes. Going beyond one-shot analyses on smaller samples of discourse, our
findings have implications for the successful management of online commons
through collective civic moderation
LEIA: Linguistic Embeddings for the Identification of Affect
The wealth of text data generated by social media has enabled new kinds of
analysis of emotions with language models. These models are often trained on
small and costly datasets of text annotations produced by readers who guess the
emotions expressed by others in social media posts. This affects the quality of
emotion identification methods due to training data size limitations and noise
in the production of labels used in model development. We present LEIA, a model
for emotion identification in text that has been trained on a dataset of more
than 6 million posts with self-annotated emotion labels for happiness,
affection, sadness, anger, and fear. LEIA is based on a word masking method
that enhances the learning of emotion words during model pre-training. LEIA
achieves macro-F1 values of approximately 73 on three in-domain test datasets,
outperforming other supervised and unsupervised methods in a strong benchmark
that shows that LEIA generalizes across posts, users, and time periods. We
further perform an out-of-domain evaluation on five different datasets of
social media and other sources, showing LEIA's robust performance across media,
data collection methods, and annotation schemes. Our results show that LEIA
generalizes its classification of anger, happiness, and sadness beyond the
domain it was trained on. LEIA can be applied in future research to provide
better identification of emotions in text from the perspective of the writer.
The models produced for this article are publicly available at
https://huggingface.co/LEI
A Unified System for Aggression Identification in English Code-Mixed and Uni-Lingual Texts
Wide usage of social media platforms has increased the risk of aggression,
which results in mental stress and affects the lives of people negatively like
psychological agony, fighting behavior, and disrespect to others. Majority of
such conversations contains code-mixed languages[28]. Additionally, the way
used to express thought or communication style also changes from one social
media plat-form to another platform (e.g., communication styles are different
in twitter and Facebook). These all have increased the complexity of the
problem. To solve these problems, we have introduced a unified and robust
multi-modal deep learning architecture which works for English code-mixed
dataset and uni-lingual English dataset both.The devised system, uses
psycho-linguistic features and very ba-sic linguistic features. Our multi-modal
deep learning architecture contains, Deep Pyramid CNN, Pooled BiLSTM, and
Disconnected RNN(with Glove and FastText embedding, both). Finally, the system
takes the decision based on model averaging. We evaluated our system on English
Code-Mixed TRAC 2018 dataset and uni-lingual English dataset obtained from
Kaggle. Experimental results show that our proposed system outperforms all the
previous approaches on English code-mixed dataset and uni-lingual English
dataset.Comment: 10 pages, 5 Figures, 6 Tables, accepted at CoDS-COMAD 202
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)
Leveraging label hierarchy using transfer and multi-task learning: A case study on patent classification
When labels are organized into a meaningful taxonomy, the parent-child relationship between labels at different levels can give the classifier additional information not deducible from the data alone, especially with limited training data. As a case study, we illustrate this effect on the task of patent classification—the task of categorizing patent documents based on their technical content. Existing approaches do not take into consideration this additional information. Experiments on two patent classification datasets, WIPO-alpha and USPTO-2M, show that our regularized Gated Recurrent Unit (GRU) architecture already gives a performance improvement with a micro-averaged precision score using the top prediction of 0.5191 and 0.5740 on the two datasets, respectively. However, knowledge transfer along the label hierarchy gives further significant improvement on WIPO-alpha, raising the score to 0.5376, and a small improvement on USPTO-2M to 0.5743. Our analyses reveal that incorporating label information improves performance on classes with fewer examples and makes model robust to errors that result from predicting closely related labels
Social media sharing of low quality news sources by political elites
Increased sharing of untrustworthy information on social media platforms is one of the main challenges of our modern information society. Because information disseminated by political elites is known to shape citizen and media discourse, it is particularly important to examine the quality of information shared by politicians. Here, we show that from 2016 onward, members of the Republican Party in the US Congress have been increasingly sharing links to untrustworthy sources. The proportion of untrustworthy information posted by Republicans versus Democrats is diverging at an accelerating rate, and this divergence has worsened since President Biden was elected. This divergence between parties seems to be unique to the United States as it cannot be observed in other western democracies such as Germany and the United Kingdom, where left–right disparities are smaller and have remained largely constant